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ABSTRACT 

In this paper we present an automated method for the 
classification of the origin of non-native speakers. The 
origin of non-native speakers could be identified by a hu- 
man listener based on the detection of typical pronuncia- 
tions for each nationality. Thus we suppose the existence 
of several phoneme sequences that might allow the clas- 
sification of the origin of non-native speakers. Our new 
method is based on the extraction of discriminative se- 
quences of phonemes from a non-native English speech 
database. These sequences are used to construct a prob- 
abilistic classifier for the speakers' origin. The existence 
of discriminative phone sequences in non-native speech is 
a significant result of this work. The system that we have 
developed achieved a significant correct classification rate 
of 96.3% and a significant error reduction compared to 
some other tested techniques. 

1. INTRODUCTION 

The problem of non-native speaker origin classification 
consists in detecting the mother tong of speakers utter- 
ing non-native speech. For example, the detection of the 
nationality of Spanish or French people uttering English 
words. This is different from simple origin detection as 
in the latter case, the decision is taken over native speech 
(ex. English people uttering English speech vs. French 
people uttering French speech). The issue we target here 
is closer to regional accent detection for the same native 
language. 

With the recent advances in the speech recognition 
field, the automatic speech recognition (ASR) is more and 
more used especially in call centers. ASR is beneficial for 
both the callers and the call center companies as it allows 
an automated and fast processing with natural language 
and allows the reduction of the number of human opera- 
tors in the repetitive task of phone replying. Let's consider 
the application of a car renting call center based on auto- 
matic speech recognition. In such case, the ASR system 
will interact with the customer and collect the information 
of his order such as the car type, the duration of the rent, 
the pick point etc. In this case, if the origin of non-native 
speakers is known, an adapted ASR system can be used 



in order to have better recognition accuracy. For the latter 
application as for a plane ticketing call center, there is a 
high probability of encountering non-native speakers. 

The work presented here is part of the European pro- 
ject HIWIRE : Human Input that Works in Real Environ- 
ments. It aims at the developments of means for helping 
human operators performing their duties in real environ- 
ment conditions. The HIWIRE project consists in devel- 
oping an automated system based on ASR that assists air- 
craft pilots in their tasks and communications. As commu- 
nications between pilots and control operators must be in 
English, the system that is under construction within HI- 
WIRE will inherently be confronted with non-native En- 
glish speech. Traditional ASR systems would be ineffi- 
cient in such case as their performance drops drastically 
when confronted with non-native speech. This perfor- 
mance drop is a well known problem (see 0}). 

Recent research works for non-native speech have al- 
ready allowed a significant improvement in that filed. The 
approaches described in Q), Q and |4| allowed significa- 
tive performance enhancement against non-native speech. 
Nevertheless, those approaches require the knowledge of 
the origin of the speakers uttering the speech they are ap- 
plied to. Indeed, the modifications applied to the ASR 
system depend on both the native language and the spo- 
ken language. 

A foreign accent classification procedure could be a 
great asset to any system based on speech recognition and 
confronted with non-native speakers. Only few articles 
have been published concerning non-native accent clas- 
sification. For that matter, the approach developed by 
C. Teixeira et al. [6] was based on HMM phone mod- 
els. It achieved 65.5% classification rate on isolated words 
database of Danish, German, British, Spanish and Italian 
speech. Arslan et al. O used HMM phone models and 
HMM word models to identify Neutral, Chinese, Turk- 
ish and German accents. The method achieved a 68.3% 
classification rate on 5 isolated words. The approach of 
P. Angkitirakul et al. Q, based on Stochastic Trajectory 
Models (STM) and Parametric Trajectory Models (PTM), 
performed by 40.6% classification rate in supervised mode. 
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In the next section, we will describe the extraction of 
discriminative phone sequences for each foreign language 
and will sketch the decision process based on conditional 
probabilities. In section [3j we will describe the tests that 
we have carried and discuss their results. We will also 
discuss the future research work. Finally, we will end with 
a brief conclusion. 

2. FOREIGN ACCENT DETECTION 

It is well known that non-native speakers might produce 
pronunciation errors when uttering foreign speech ([1|, 
101 , J4|, |S), ||9]). These errors are due to the phonological 
and articulatory properties of both the spoken and native 
languages. 

For instance, some phonemes of the spoken language 
(SL) might not exist in the native language (NL) of the 
speaker. The speaker may replace these phonemes by 
some acoustically close phonemes of his NL. For instance, 
in French, diphthongs do not exist and some French speak- 
ers pronounce instead a sequence of two French phones. 

These non-native pronunciation error depends on the 
pair of spoken and native languages. P. Ladefoged et al. 
[8] and R.J. Jeffers et al. [9| have depicted in their work 
a list of common phone uttering errors made by various 
groups of non-native speakers for the English language 
(French, Italian, Greek etc.). Indeed, speakers from the 
same origin are very likely to commit the same pronunci- 
ation errors as they share the same native language, and 
thus the same articulatory and phonological mechanisms. 
Human listeners rely on those common errors as hints and 
tips to decide on the origin of non-native speakers. In 
the work of Arslan et al. [5] and Angkitirakul et al. [6|, 
human listener achieved 54% and 84% in foreign speech 
classification. 

Our approach described here is based on that feature. 
We suppose the existence of discriminative uttering struc- 
tures at the phonetic level that are shared among speakers 
from a particular origin when they speak a foreign lan- 
guage. In other words, we suppose that speakers from a 
particular origin X utter some discriminative sequences of 
phonemes when they speak a foreign language Y. 

We suppose that for a set of origins L = {Li..L n } 
and a foreign language F, there exist sets of phoneme se- 
quences Si..Sn corresponding to the origins L\..L n (re- 
spectively) that might discriminate the native languages of 
L\..L n speakers when they utter F speech (Si = {s^i.-Sj^}, 
and Sj >m are sequences of phones). 



adapted on the respective non-native database: i.e. the 
models Mi are adapted on the B\ resulting in M[, and so 
on. Then, to extract these discriminative phone sequences, 
we perform a phonetic recognition with a phonetic pool 
M = Um=i -^Ci on eacn °f tne non-native databases. 
For each native language Li, we count the occurrences 
of all the phone sequences having a maximum length of 
maxp phones in the phonetic recognition results. The 
number of occurrences of the phone sequences is normal- 
ized against the number of sentences that compose each 
non-native parts of the database (E>i..B n ). 

This processing results in sets of preliminary phone 
sequences with their normalized number of occurrences 
for each language Lj. Those sets are noted S' t = {s^i-S^}, 
and the number of occurrences is noted rii(s) for a se- 
quence s (for a language Li). The next step consists in 
retaining the sets of discriminative sequences Si. For Li, 
a sequence s e 5- is considered discriminative only if it 
verifies the equationQ] 



rii(s) >— a * nfc(s), Vfc ^ i 
where a is a discriminant factor, a > 1. 



(1) 



Knowing the Si sets and the counts of appearances of 
each of their sequences, some probabilities can be com- 
puted. All the following probabilities are conditional prob- 
abilities conditioned by the acoustic models M and the 
sets Si..Sn- For readability reasons, we will omit these 
conditions in the probabilities notations. The maximum 
likelihood (ML) probability P(Li), P(s) and P{s/L.j) are 
computed as follows : 
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Using the bayes rule and the equations [2] [3] and [4] the 
conditional probability of a language Lj knowing a se- 
quence s can be computed as in the equation l2.ll 



P(Li/s) = 



P{s/Lj) * P{Lj) 
P(s) 



2.1. Discriminative phoneme sequences extraction 

In order to better model the non-native speech, we have 
chosen to use the phone acoustic models (HMM) of all the 
native languages L\..L n (the models are noted M\..M n 
respectively). Let the non-native database B = U™=i -^»> 
where Bi is the part of the database composed of Li speak- 
ers uttering F speech. First, the native phone models are 
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(5) 



The conditional probability of a language Li knowing 
a list of sequences O = {si..Sh} can be computed as in 
equation l2.1l using the bayes rule, the equations above and 



the hypothesis that sequences of O are independent. The 
hypothesis of independence of the sequences is not true. 
Nevertheless, this hypothesis must be assumed in order to 
compute this probability. Indeed, determining the inter- 
relations between sequences of phones might prove to be 
impossible to compute with regards to the small size of 
our database. 



P{Li/0) = 



P(0/Li)P(Li) _ P(ai..a h /Li)P(Li 
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3. EXPERIMENTS 

3.1. Experimental conditions 

Our tests have been carried out on the HIWIRE non-native 
speech database. This database have been tested in the ap- 
proaches presented in [1] and 0. It is composed of 81 
speakers: 31 French, 20 Greek, 20 Italian and 10 Spanish 
speakers. Each of those speakers reads 100 English sen- 
tences. The used grammar is a strict command language 
composed of 134 words. This grammar is used by aircraft 
pilots when communicating with airport control agents. 
The speech was recorded in 16 bits and 16 kHz format. 
We chose an MFCC parametrization with 13 coefficients 
and their first and second time derivatives. The acoustic 
models are 3 states HMMs (Hidden Markov Models) with 
128 Gaussian mixtures and diagonal covariance matrices. 
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3.2. Tests and results 

In ours tests, we have used 39 French, 33 Greek, 32 Span- 
ish and 49 Italian monophone HMM models trained on 
native speech databases (respectively). As described in 
section 12.11 those models were adapted on HIWIRE non- 
native database. I.e., the French models were adapted on 
all the French speakers, etc. 



2.2. Classification of a speaker 

In order to detect the origin of a speaker X, some of its 
recorded utterances must be analyzed. First, a phonetic 
recognition is performed on those sentences using a the 
models M described in 12.11 All the sequence of phones 
that appear in the sets S\..S n are retained in a list O = 
{si-.s/j}. The speaker X is classified in the Li native lan- 
guage as in equation [7] 



Li = argmaxi=i.. n {P(Li/0)} 



(7) 



Another local decision approach can be adopted. In- 
stead of collecting all the sequences of phones from the 
phonetic recognition (see last paragraph) that appear in 
Si..S n in a single list O, separate lists 0\..O n correspond- 
ing to the sequences that appear in S\..S n (respectively) 
could be made up. I.e., the list Oi corresponds to all the 
sequences of phones observed in the phonetic recognition 
and that appear in the set Si (I = l..n). The decision is 
then made over the probabilities of the languages knowing 
the lists 0\..O n , i.e. P(Li/Oi),Vl = \..n. In this classi- 
fication approach, the probabilities of each language must 
be normalized over the number of sequences of each list 
0\..O n in order to allow the comparison between them. 
Besides, any language decider that has a too small corre- 
sponding list of observations should be ignored, i.e., for a 
list Oi, if exists k verifying card(Ok) > (3 card(Oi), the 
classifier Li is ignored (/3 is a factor). If we note / the set 
of language indices that are not ignored, the speaker X is 
classified in the language L h if the equation[8]is verified. 



The extraction of the phone sequences was done fol- 
lowing the "leave one out" scheme. For instance, when 
testing a French speaker X, the discriminative phone se- 
quences of the French language are extracted using all the 
French speakers except X. And in that example, the sig- 
nificant sequences of the other languages are extracted us- 
ing all the respective speakers. 

In our preliminary tests, we have chose some threshold 
values as follows : 

- a = 4 : the significance factor (see section |2~TT i. 

- max p — 3 : the maximum length of a phone se- 
quence (number of phones in the sequence). 

- 50 as the minimum occurrences count per speaker 
for a sequence to be eligible as discriminative. 

- 30 as the maximum discriminative sequences count 
per language. 



Li = argmax^iPiLt/Oi)^ 1 ^} 



(8) 



- (3 = 2.5 : see section 12721 

TableQ~]shows the confusion matrix in terms of speak- 
ers percentage for the global decision matrix. The accu- 
racy achieved is 96.29% of correct speaker classification. 
Table [2] shows the confusion matrix in terms of speakers 
percentage for the local decision matrix. The accuracy 
achieved is 87.65% of correct speaker classification. 

3.3. Discussion and future work 

The tests presented here are only preliminary and we in- 
tend to further investigate and tune the parameters of this 



Table 1. Confusion matrix for the global decision method. 
The classification rate is 96.29%. 





French 


Greek 


Italian 


Spanish 


French 


100.0 


0.0 


0.0 


0.0 


Greek 


5.0 


95.0 


0.0 


0.0 


Italian 


0.0 


5.0 


95.0 


0.0 


Spanish 


0.0 


0.0 


10.0 


90.0 



Table 2. Confusion matrix for the local decision method. 
The classification rate is 87.65%. 





French 


Greek 


Italian 


Spanish 


French 


77.1 


0.0 


0.0 


12.9 


Greek 


5.0 


85.0 


5.0 


5.0 


Italian 


5.0 


5.0 


85.0 


5.0 


Spanish 


0.0 


0.0 


0.0 


100.0 



method. The effect of some parameters like the signif- 
icance factor a, the maximum number of sequences per 
language and the minimum appearance count for sequen- 
ces will be investigated in our future work. Besides, we 
will test the use of English acoustic models adapted on 
non-native speech instead of native models. This might 
avoid the inconvenience of collecting all the acoustic mod- 
els of all non-native language that will be classified. 

The potential of native language classification based 
on discriminative phone sequence might be great. Indeed, 
we have tested other mother tong detection (over non- 
native speech) techniques inspired from the state of the 
art. We have tested a global GMM classification where a 
GMM was trained for each native language. We have also 
tested three HMM based approaches using TIMIT con- 
text independent phonemes. We have adapted the TIMIT 
phone models on the French, Greek, Spanish and Italian 
databases in a supervised fashion. The best result obtained 
with those methods is only 84% on the same HIWIRE 
database we have used. The global decision approach 
described above achieved a significantly better result of 
96.3%, giving an error reduction of 76.9% (relative). 

Another significant result of our work is the existence 
of discriminative phone sequences -or syllabic realizations- 
in non-native speech. Those phone sequences can be re- 
lied on to classify the origin of non-native speakers. 
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4. CONCLUSION 



In this paper, we presented a novel approach for the de- 
tection of the mother tong of non-native speakers based 
on discriminative phone sequences. We have determined 
that there exists some discriminative phone sequences in 
non-native speech that could help in the mother tong de- 
tection. The preliminary results we obtained show a great 
potential for this technique: 96.3% correct classification 
rate. Our method will be further tested and tuned for the 
best classification results. 



